# 1. Collect known small non-coding RNAs and other sequences to filter against

See the `Filters` subdirectory.

# 2. Perform mapping

Create an index of the genome for BWA:
```
bwa index /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa
```

## 2.1 Map the MiSeq knockdown and time course sequencing data

See the `MiSeq` subdirectory.

## 2.2 Map the HiSeq sequencing data

See the `HiSeq` subdirectory.

## 2.3 Map the CAGE sequencing data

See the `CAGE` subdirectory.

## 2.4 Map the StartSeq sequencing data

See the `StartSeq` subdirectory.

## 2.6 Find read lengths

Make an overview of the read length distribution for each data set:
```
python find_read_lengths.py 
```
yielding

| Data set | Median | Mean | Standard deviation | Minimum | Maximum |
| :--- | :---: | :---: | :---: | :---: | :---: |
| MiSeq knockdown & time course | 33 & 33 | 33 & 33 | 0 | 33 & 33 | 33 & 33 |
| HiSeq | 48 | 48 | 0 | 48 | 48 |
| CAGE  | 27 | 26 | 0.7 | 20 | 35 |


# 3. Perform annotations of spliced mature transcripts and precursor regions

## 3.1 MiSeq knockdown and time course data

Annotate all reads aligning across splice junctions to mature mRNAs, mature lncRNAs, or mature gencode transcripts by running the `annotate_mature_transcript_alignments.py` script on each MiSeq library:
```
python make_mature_transcript_annotation_scripts.py MiSeq
bash -x script.sh
```
This will run
```
python annotate_mature_transcript_alignments.py MiSeq <library>
```
on each MiSeq library.
This script adds an `XE:Z:spliced_mRNA`, `XE:Z:spliced_lncRNA`, `XE:Z:spliced_gencode`, `XE:Z:spliced:histone`,`XE:Z:spliced_MALAT1`, `XE:Z:spliced_TERC`, `XE:Z:spliced_snhg` to each alignment to a mature mRNA, lncRNA, gencode, histone, MALAT1, TERC, or snoRNA host gene transcript that goes over a splice junction.

Remove the intermediate files:
```
rm script.sh
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

Annotate all reads overlapping snRNA, tRNA, snoRNA, or scaRNA precursor regions:
```
python make_precursor_annotation_scripts.py MiSeq
bash -x script.sh
```
This will run
```
python annotate_precursor_regions.py MiSeq <library>
```
on each MiSeq library.
This adds an `XA:Z:presnRNA`, `XA:Z:pretRNA`, `XA:Z:presnoRNA`, `XA:Z:prescaRNA` tag to alignments overlapping precursor regions.


Remove the intermediate files:
```
rm script.sh
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

Store the BAM files generated by this script:
```
mv ???_r?.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
```

Add the NH tag with the number of multimapping locations, and the HI tag showing the index of each of these multimapping alignments:
```
python make_add_multimap_tag_scripts.py MiSeq
bash -x script.sh
```
This will run
```
python add_multimap_tag.py MiSeq <library>
```
on each library, and generate a new BAM file.

Store the BAM files generated by this script:
```
mv ???_r?.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

## 3.4 HiSeq data

Annotate all reads aligning across splice junctions to mature mRNAs, mature lncRNAs, or mature gencode transcripts by running the `annotate_mature_transcript_alignments.py` script on each HiSeq library:
```
python make_mature_transcript_annotation_scripts.py HiSeq
bash -x script.sh
```
This will run
```
python annotate_mature_transcript_alignments.py HiSeq <library>
```
on each HiSeq library.
This script adds an `XE:Z:spliced_mRNA`, `XE:Z:spliced_lncRNA`, `XE:Z:spliced_gencode`, `XE:Z:spliced:histone`,`XE:Z:spliced_MALAT1`, `XE:Z:spliced_TERC`, `XE:Z:spliced_snhg` to each alignment to a mature mRNA, lncRNA, gencode, histone, MALAT1, TERC, or snoRNA host gene transcript that goes over a splice junction.

Remove the intermediate files:
```
rm script.sh
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

Annotate all reads overlapping snRNA, tRNA, snoRNA, or scaRNA precursor regions:
```
python make_precursor_annotation_scripts.py HiSeq
bash -x script.sh
```
This will run
```
python annotate_precursor_regions.py HiSeq <library>
```
on each HiSeq library.
This adds an `XA:Z:presnRNA`, `XA:Z:pretRNA`, `XA:Z:presnoRNA`, `XA:Z:prescaRNA` tag to alignments overlapping precursor regions.


Remove the intermediate files:
```
rm script.sh
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

Store the BAM files generated by this script:
```
mv t??_r?.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
```

Add the NH tag with the number of multimapping locations, and the HI tag showing the index of each of these multimapping alignments:
```
python make_add_multimap_tag_scripts.py HiSeq
bash -x script.sh
```
This will run
```
python add_multimap_tag.py HiSeq <library>
```
on each library, and generate a new BAM file.

Store the BAM files generated by this script:
```
mv t??_r?.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

## 3.5 CAGE data

Annotate all reads aligning across splice junctions to mature mRNAs, mature lncRNAs, or mature gencode transcripts by running the `annotate_mature_transcript_alignments.py` script on each CAGE library:
```
python make_mature_transcript_annotation_scripts.py CAGE
bash -x script.sh
```
This will run
```
python annotate_mature_transcript_alignments.py CAGE <library>
```
on each CAGE library.
This script adds an `XE:Z:spliced_mRNA`, `XE:Z:spliced_lncRNA`, `XE:Z:spliced_gencode`, `XE:Z:spliced:histone`,`XE:Z:spliced_MALAT1`, `XE:Z:spliced_TERC`, `XE:Z:spliced_snhg` to each alignment to a mature mRNA, lncRNA, gencode, histone, MALAT1, TERC, or snoRNA host gene transcript that goes over a splice junction.

Remove the intermediate files:
```
rm script.sh
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```

Annotate all reads overlapping snRNA, tRNA, snoRNA, or scaRNA precursor regions:
```
python make_precursor_annotation_scripts.py CAGE
bash -x script.sh
```
This will run
```
python annotate_precursor_regions.py CAGE <library>
```
on each CAGE library.
This adds an `XA:Z:presnRNA`, `XA:Z:pretRNA`, `XA:Z:presnoRNA`, `XA:Z:prescaRNA` tag to alignments overlapping precursor regions.


Remove the intermediate files:
```
rm script.sh
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```

Store the BAM files generated by this script:
```
mv ??_hr_?.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
```

Add the NH tag with the number of multimapping locations, and the HI tag showing the index of each of these multimapping alignments:
```
python make_add_multimap_tag_scripts.py CAGE
bash -x script.sh
```
This will run
```
python add_multimap_tag.py CAGE <library>
```
on each library, and generate a new BAM file.

Store the BAM files generated by this script:
```
mv ??_hr_?.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
```
Remove the intermediate files:
```
rm script.sh
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```

## 3.6 StartSeq data

Annotate all reads aligning across splice junctions to mature mRNAs, mature lncRNAs, or mature gencode transcripts by running the `annotate_mature_transcript_alignments.py` script on each StartSeq library:
```
python make_mature_transcript_annotation_scripts.py StartSeq
bash -x script.sh
```
This will run
```
python annotate_mature_transcript_alignments.py StartSeq <library>
```
on each StartSeq library.
This script adds an `XE:Z:spliced_mRNA`, `XE:Z:spliced_lncRNA`, `XE:Z:spliced_gencode`, `XE:Z:spliced:histone`,`XE:Z:spliced_MALAT1`, `XE:Z:spliced_TERC`, `XE:Z:spliced_snhg` to each alignment to a mature mRNA, lncRNA, gencode, histone, MALAT1, TERC, or snoRNA host gene transcript that goes over a splice junction.

Remove the intermediate files:
```
rm script.sh
rm script_StartSeq_*.sh
rm script_StartSeq_*.stdout
rm script_StartSeq_*.stderr
```

Annotate all reads overlapping snRNA, tRNA, snoRNA, or scaRNA precursor regions:
```
python make_precursor_annotation_scripts.py StartSeq
bash -x script.sh
```
This will run
```
python annotate_precursor_regions.py StartSeq <library>
```
on each StartSeq library.
This adds an `XA:Z:presnRNA`, `XA:Z:pretRNA`, `XA:Z:presnoRNA`, `XA:Z:prescaRNA` tag to alignments overlapping precursor regions.


Remove the intermediate files:
```
rm script.sh
rm script_StartSeq_*.sh
rm script_StartSeq_*.stdout
rm script_StartSeq_*.stderr
```

Store the BAM files generated by this script:
```
mv SRR707145?.bam /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/Mapping/
```

Add the NH tag with the number of multimapping locations, and the HI tag showing the index of each of these multimapping alignments:
```
python make_add_multimap_tag_scripts.py StartSeq
bash -x script.sh
```
This will run
```
python add_multimap_tag.py StartSeq <library>
```
on each library, and generate a new BAM file.

Store the BAM files generated by this script:
```
mv SRR707145?.bam /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_StartSeq_*.sh
rm script_StartSeq_*.stdout
rm script_StartSeq_*.stderr
```

# 4. Predict novel enhancers

First, we create a mask to exclude all promoters and exonic regions from the enhancer prediction (using a +- 500 bp window around annotated TSSs, and +- 200 bp around exons):
```
cd Enhancers
python make_mask.py
cd ..
```
creating the file `mask/hg38_neg_filter_500_merged.bed`.

## 4.1 Predict novel enhancers from the MiSeq knockdown and time course data

First, we perform enhancer prediction using the MiSeq data:
```
python make_prepare_enhancer_prediction_scripts.py MiSeq
bash -x script.sh
```
This runs
```
python prepare_enhancer_prediction.py MiSeq <library>
```
on each MiSeq library, generating a CTSS file `MiSeq.<library>.ctss.bed`.

Remove the intermediate files:
```
rm script.sh
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

Create a merged CTSS file:
```
cat MiSeq.t??_r?.ctss.bed | sortBed -i | mergeBed -s -d -1 -c 4,5,6 -o distinct,sum,distinct > MiSeq.ctss.bed
gzip MiSeq.ctss.bed
echo MiSeq.ctss.bed.gz > pathlist
```

Then perform enhancer prediction:

```
Enhancers/scripts/bidir_enhancers -f pathlist -m Enhancers/mask/hg38_neg_filter_500_merged.bed -o enhancer_predictions
mv enhancer_predictions/enhancers.bed enhancer_predictions/enhancers.MiSeq.bed
```

Note that the `bidir_enhancers` script requires a candidate novel enhancer to be bidirectionally transcribed in at least one sample. Merging all samples and then predicting enhancers results in more predicted enhancers than predicting enhancers from the `MiSeq.<library>.ctss.bed` files before merging. To be evaluated which strategy yields more reliable enhancers.

The predicted enhancers are stored in the subdirectory `enhancer_predictions`. Count the number of predicted enhancers:
```
wc enhancer_predictions/enhancers.MiSeq.bed
```
which reports 15 predicted enhancers.
Count how many FANTOM5 enhancers were included in the predictions:
Evaluate the overlap with FANTOM5 enhancers:
```
intersectBed -u -a enhancer_predictions/enhancers.MiSeq.bed -b /osc-fs_home/mdehoon/Data/Fantom5/Enhancers/F5.hg38.enhancers.bed.gz | wc
```
which reports that 8 MiSeq enhancers overlap FANTOM5 enhancers.

Evaluate the overlap with Roadmap Epigenomics enhancer regions:
```
intersectBed -u -a enhancer_predictions/enhancers.MiSeq.bed -b /osc-fs_home/mdehoon/Data/RoadmapEpigenomics/enhancers.bed | wc
```
which reports that 4 MiSeq enhancers overlapped Roadmap Epigenomics enhancer regions.

Find the predicted enhancers that do not overlap a FANTOM5 enhancer or a Roadmap Epigenomics enhancer region:
```
intersectBed -v -a enhancer_predictions/enhancers.MiSeq.bed -b /osc-fs_home/mdehoon/Data/Fantom5/Enhancers/F5.hg38.enhancers.bed.gz | intersectBed -v -a stdin -b /osc-fs_home/mdehoon/Data/RoadmapEpigenomics/enhancers.bed | sortBed -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes > enhancer_predictions/novel_enhancers.MiSeq.bed
wc enhancer_predictions/novel_enhancers.MiSeq.bed
```
which reports that 5 novel enhancers were found from the MiSeq data.

## 4.4 Predict novel enhancers from the HiSeq data

First, we perform enhancer prediction using the HiSeq data:
```
python make_prepare_enhancer_prediction_scripts.py HiSeq
bash -x script.sh
```
This runs
```
python prepare_enhancer_prediction.py HiSeq <library>
```
on each HiSeq library, generating a CTSS file `HiSeq.<library>.ctss.bed`.

Remove the intermediate files:
```
rm script.sh
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

Create a merged CTSS file:
```
cat HiSeq.t??_r?.ctss.bed | sortBed -i | mergeBed -s -d -1 -c 4,5,6 -o distinct,sum,distinct > HiSeq.ctss.bed
gzip HiSeq.ctss.bed
echo HiSeq.ctss.bed.gz > pathlist
```
Then perform enhancer prediction:
```
Enhancers/scripts/bidir_enhancers -f pathlist -m Enhancers/mask/hg38_neg_filter_500_merged.bed -o enhancer_predictions
mv enhancer_predictions/enhancers.bed enhancer_predictions/enhancers.HiSeq.bed
```
The predicted enhancers are stored in the subdirectory `enhancer_predictions`. Count the number of predicted enhancers:
```
wc enhancer_predictions/enhancers.HiSeq.bed
```
which reports 11307 predicted enhancers.
Count how many FANTOM5 enhancers were included in the predictions:
Evaluate the overlap with FANTOM5 enhancers:
```
intersectBed -u -a enhancer_predictions/enhancers.HiSeq.bed -b /osc-fs_home/mdehoon/Data/Fantom5/Enhancers/F5.hg38.enhancers.bed.gz | wc
```
which reports that 3881 HiSeq enhancers overlap FANTOM5 enhancers.

Evaluate the overlap with Roadmap Epigenomics enhancer regions:
```
intersectBed -u -a enhancer_predictions/enhancers.HiSeq.bed -b /osc-fs_home/mdehoon/Data/RoadmapEpigenomics/enhancers.bed | wc
```
which reports that 7644 HiSeq enhancers overlapped Roadmap Epigenomics enhancer regions.

Find the predicted enhancers that do not overlap a FANTOM5 enhancer or a Roadmap Epigenomics enhancer region:
```
intersectBed -v -a enhancer_predictions/enhancers.HiSeq.bed -b /osc-fs_home/mdehoon/Data/Fantom5/Enhancers/F5.hg38.enhancers.bed.gz | intersectBed -v -a stdin -b /osc-fs_home/mdehoon/Data/RoadmapEpigenomics/enhancers.bed | sortBed -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes > enhancer_predictions/novel_enhancers.HiSeq.bed
wc enhancer_predictions/novel_enhancers.HiSeq.bed
```
which reports that 2574 novel enhancers were found from the HiSeq data.

## 4.5 Predict novel enhancers from the CAGE data

First, we perform enhancer prediction using the CAGE data:
```
python make_prepare_enhancer_prediction_scripts.py CAGE
bash -x script.sh
```
This runs
```
python prepare_enhancer_prediction.py CAGE <library>
```
on each CAGE library, generating a CTSS file `CAGE.<library>.ctss.bed`.
Remove the intermediate files:
```
rm script.sh
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```
Create a merged CTSS file:
```
cat CAGE.??_hr_?.ctss.bed | sortBed -i | mergeBed -s -d -1 -c 4,5,6 -o distinct,sum,distinct > CAGE.ctss.bed
gzip CAGE.ctss.bed
echo CAGE.ctss.bed.gz > pathlist
```
Then perform enhancer prediction:
```
Enhancers/scripts/bidir_enhancers -f pathlist -m Enhancers/mask/hg38_neg_filter_500_merged.bed -o enhancer_predictions
mv enhancer_predictions/enhancers.bed enhancer_predictions/enhancers.CAGE.bed
```
The predicted enhancers are stored in the subdirectory `enhancer_predictions`. Count the number of predicted enhancers:
```
wc enhancer_predictions/enhancers.CAGE.bed
```
which reports 18167 predicted enhancers.
Count how many FANTOM5 enhancers were included in the predictions:
Evaluate the overlap with FANTOM5 enhancers:
```
intersectBed -u -a enhancer_predictions/enhancers.CAGE.bed -b /osc-fs_home/mdehoon/Data/Fantom5/Enhancers/F5.hg38.enhancers.bed.gz | wc
```
which reports that 3930 CAGE enhancers overlap FANTOM5 enhancers.

Evaluate the overlap with Roadmap Epigenomics enhancer regions:
```
intersectBed -u -a enhancer_predictions/enhancers.CAGE.bed -b /osc-fs_home/mdehoon/Data/RoadmapEpigenomics/enhancers.bed | wc
```
which reports that 10455 CAGE enhancers overlapped Roadmap Epigenomics enhancer regions.

Find the predicted enhancers that do not overlap a FANTOM5 enhancer or a Roadmap Epigenomics enhancer region:
```
intersectBed -v -a enhancer_predictions/enhancers.CAGE.bed -b /osc-fs_home/mdehoon/Data/Fantom5/Enhancers/F5.hg38.enhancers.bed.gz | intersectBed -v -a stdin -b /osc-fs_home/mdehoon/Data/RoadmapEpigenomics/enhancers.bed | sortBed -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes > enhancer_predictions/novel_enhancers.CAGE.bed
wc enhancer_predictions/novel_enhancers.CAGE.bed
```
which reports 6533 novel enhancers were found from the CAGE data.

# 5. Analyze the dependence of the number of predicted enhancers on sequencing depth

For both the HiSeq data and the CAGE data, create a single BAM file for all samples:

```
python initialize_enhancer_predictions_randomized.py HiSeq
python initialize_enhancer_predictions_randomized.py CAGE
```

This creates the BAM files `HiSeq.bam` and `CAGE.bam`, with an `RX` tag added to each alignment with an integer ranging from 0 to the number of query sequences, randomly permuted.
Sort these alignments by this `RX` tag:
```
samtools sort -t RX HiSeq.bam -o HiSeq.sorted.bam
samtools sort -t RX CAGE.bam -o CAGE.sorted.bam
```
The resulting BAM files `HiSeq.sorted.bam` and `CAGE.sorted.bam`, now contain all alignments, in an random order.
Create CTSS files for n million sequence reads in the HiSeq and CAGE:
```
python prepare_enhancer_prediction_randomized.py HiSeq
python prepare_enhancer_prediction_randomized.py CAGE
```
This creates files named `HiSeq.<n>000000.ctss.bed`, `HiSeq.674085062.ctss.bed`, `CAGE.<n>000000.ctss.bed`, `CAGE.191488291.ctss.bed`, where n is an integer.

Predict enhancers for each CTSS file:
```
python analyze_enhancer_prediction_randomized.py HiSeq
python analyze_enhancer_prediction_randomized.py CAGE
```
This creates files `enhancers.<dataset>.<n>000000.bed`, `enhancers.HiSeq.674085062.bed`, `enhancers.CAGE.191488291.bed`, in directory `enhancer_predictions_randomized`.

Create the figure:
```
python -i make_figure_enhancers_randomized.py
```
generating `figure_enhancer_predictions_randomized.png`.

Remove the intermediate files:
```
rm HiSeq.bam
rm CAGE.bam
rm HiSeq.sorted.bam
rm CAGE.sorted.bam
rm HiSeq.*[0-9].ctss.bed
rm CAGE.*[0-9].ctss.bed
```

# 6. Repeat the annotation, now including the novel enhancers

## 6.1 Annotate the MiSeq knockdown and time course sequencing data

Annotate all reads overlapping last exons, FANTOM5 enhancers, Roadmap Epigenomics enhancers and dyadic regions, and novel enhancers predicted from the HiSeq and CAGE data:
```
python make_annotation_scripts.py MiSeq
bash -x script.sh
```
This runs
```
python annotate.py MiSeq <library>
```
on each library. Store the `BAM` files generated by this script:
```
mv c91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv cel_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv lip_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv neg_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

Annotate remaining reads based on their association with genes:
```
python make_classify_gene_overlap_scripts.py MiSeq
bash -x script.sh
```
This runs
```
python classify_gene_overlap.py MiSeq <library>
```
on each library, adding the tags `XA:Z:sense_proximal`, `XA:Z:sense_upstream`, `XA:Z:sense_distal`, `XA:Z:prompt`, `XA:Z:CASPAR`, or `XA:Z:antisense_distal` to appropriate alignments, with the corresponding gene name under the `XG` tag and the distance in base pairs under the `XD` tag.

Store the `BAM` files generated by this script:
```
mv c91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv c91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv cel_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv gfi_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv lip_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv myb_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv n91_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv neg_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv nkd_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t00_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t01_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t04_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t12_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t24_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
mv t96_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

## 6.4 Annotate the HiSeq sequencing data

Annotate all reads overlapping last exons, FANTOM5 enhancers, Roadmap Epigenomics enhancers and dyadic regions, and novel enhancers predicted from the HiSeq and CAGE data:
```
python make_annotation_scripts.py HiSeq
bash -x script.sh
```
This runs
```
python annotate.py HiSeq <library>
```
on each library. Store the `BAM` files generated by this script:
```
mv t00_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t00_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t00_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t01_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t01_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t01_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t04_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t04_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t04_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t12_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t12_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t12_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t24_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t24_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t24_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t96_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t96_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t96_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

Annotate remaining reads based on their association with genes:
```
python make_classify_gene_overlap_scripts.py HiSeq
bash -x script.sh
```
This runs
```
python classify_gene_overlap.py HiSeq <library>
```
on each library, adding the tags `XA:Z:sense_proximal`, `XA:Z:sense_upstream`, `XA:Z:sense_distal`, `XA:Z:prompt`, `XA:Z:CASPAR`, or `XA:Z:antisense_distal` to appropriate alignments, with the corresponding gene name under the `XG` tag and the distance in base pairs under the `XD` tag.

Store the `BAM` files generated by this script:
```
mv t00_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t00_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t00_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t01_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t01_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t01_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t04_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t04_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t04_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t12_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t12_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t12_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t24_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t24_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t24_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t96_r1.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t96_r2.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
mv t96_r3.bam /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

## 6.5 Annotate the CAGE sequencing data

Annotate all reads overlapping last exons, FANTOM5 enhancers, Roadmap Epigenomics enhancers and dyadic regions, and novel enhancers predicted from the HiSeq and CAGE data:
```
python make_annotation_scripts.py CAGE
bash -x script.sh
```
This runs
```
python annotate.py CAGE <library>
```
on each library. Store the `BAM` files generated by this script:
```
mv 00_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 00_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 00_hr_G.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 00_hr_H.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 01_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 01_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 01_hr_G.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 04_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 04_hr_E.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 12_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 12_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 24_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 24_hr_E.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 96_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 96_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 96_hr_E.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```

Annotate remaining reads based on their association with genes:
```
python make_classify_gene_overlap_scripts.py CAGE
bash -x script.sh
```
This runs
```
python classify_gene_overlap.py CAGE <library>
```
on each library, adding the tags `XA:Z:sense_proximal`, `XA:Z:sense_upstream`, `XA:Z:sense_distal`, `XA:Z:prompt`, `XA:Z:CASPAR`, or `XA:Z:antisense_distal` to appropriate alignments, with the corresponding gene name under the `XG` tag and the distance in base pairs under the `XD` tag.

Store the `BAM` files generated by this script:
```
mv 00_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 00_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 00_hr_G.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 00_hr_H.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 01_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 01_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 01_hr_G.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 04_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 04_hr_E.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 12_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 12_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 24_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 24_hr_E.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 96_hr_A.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 96_hr_C.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
mv 96_hr_E.bam /osc-fs_home/mdehoon/Data/CASPARs/CAGE/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```

## 6.6 Annotate the StartSeq sequencing data

Annotate all reads overlapping last exons, FANTOM5 enhancers, Roadmap Epigenomics enhancers and dyadic regions, and novel enhancers predicted from the HiSeq and CAGE data:
```
python make_annotation_scripts.py StartSeq
bash -x script.sh
```
This runs
```
python annotate.py StartSeq <library>
```
on each library. Store the `BAM` files generated by this script:
```
mv SRR707145?.bam /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_StartSeq_*.sh
rm script_StartSeq_*.stdout
rm script_StartSeq_*.stderr
```

Annotate remaining reads based on their association with genes:
```
python make_classify_gene_overlap_scripts.py StartSeq
bash -x script.sh
```
This runs
```
python classify_gene_overlap.py StartSeq <library>
```
on each library, adding the tags `XA:Z:sense_proximal`, `XA:Z:sense_upstream`, `XA:Z:sense_distal`, `XA:Z:prompt`, `XA:Z:CASPAR`, or `XA:Z:antisense_distal` to appropriate alignments, with the corresponding gene name under the `XG` tag and the distance in base pairs under the `XD` tag.

Store the `BAM` files generated by this script:
```
mv SRR707145?.bam /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/Mapping/
```

Remove the intermediate files:
```
rm script.sh
rm script_StartSeq_*.sh
rm script_StartSeq_*.stdout
rm script_StartSeq_*.stderr
```

# 7. Evaluate annotations

Create a table with the number of reads for each functional category in each library:
```
python make_annotation_table.py MiSeq
python make_annotation_table.py HiSeq
python make_annotation_table.py CAGE
python make_annotation_table.py StartSeq
```
generating the files `annotations.MiSeq.txt`, `annotations.HiSeq.txt`, `annotations.CAGE.txt`, and `annotations.StartSeq.txt`.

Make a figure with an overview of the annotation counts in each library:
```
python -i make_figure_annotations_timecourse.py
```
which writes the file `table_annotations_timecourse.txt`, and generates the figure `figure_annotations_timecourse.png`.

Create Supplementary Table S3 with the library counts for each functional category:
```
python make_supplementary_table_annotations.py
```
creating file `TableS3.txt`.

Make a table with the inferred RNA size for each of the functional categories:
```
python make_rna_size_table.py MiSeq
```
generating the file `rnasize.MiSeq.txt`. Draw a figure with the RNA size distribution for each of the functional categories:
```
python make_figure_rna_sizes.py
```
generating the figure `figure_rna_sizes_MiSeq.png`.


# 8. Create CTSS files and a gene information table

Create a CTSS file for each MiSeq, HiSeq, CAGE, and StartSeq library by running `create_ctss_file.py`, skipping multimapping reads as well as reads that mapped to mitochondrial DNA, histone mRNAs, small nucleolar RNAs (`snoRNA`), small nucleolar RNA precursors (`presnoRNA`), Cajal-body specific RNAs (`scaRNA`), small Cajal-body specific RNA precursors (`prescaRNA`), ribosomal RNA (`rRNA`), small nuclear RNAs (`snRNA`), small nuclear RNA precursors (`presnRNA`), small cytoplasmic RNAs (`scRNA`), transfer RNAs (`tRNA`), transfer RNA precursors (`pretRNA`), Ro-associated RNAs (`yRNA`), vault RNAs (`vRNA`), or small ILF3/NF90-associated RNAs (`snar`). Use the CTSS files for the CAGE libraries to create a gene information table.

## 8.1 Create CTSS files for the MiSeq knockdown and time course data

```
python make_create_ctss_file_scripts.py MiSeq
bash -x script.sh
```
This will run
```
python create_ctss_file.py MiSeq <library>
```
on each library, generating the CTSS file `<library>.ctss.bed`.
Compress and store the CTSS files:
```
gzip ???_r?.ctss.bed
mkdir /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS
mv ???_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS
```
Remove the intermediate files:
```
rm script_MiSeq_*.sh
rm script_MiSeq_*.stdout
rm script_MiSeq_*.stderr
```

## 8.2 Create CTSS files for the HiSeq data

```
python make_create_ctss_file_scripts.py HiSeq
bash -x script.sh
```
This will run
```
python create_ctss_file.py HiSeq <library>
```
on each library, generating the CTSS file `<library>.ctss.bed`.
Compress and store the CTSS files:
```
gzip t??_r?.ctss.bed
mkdir /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS
mv t??_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS
```
Remove the intermediate files:
```
rm script_HiSeq_*.sh
rm script_HiSeq_*.stdout
rm script_HiSeq_*.stderr
```

## 8.3 Create CTSS files for the CAGE data, and a gene information table

```
python make_create_ctss_file_scripts.py CAGE
bash -x script.sh
```
This will run
```
python create_ctss_file.py CAGE <library>
```
on each library, generating the CTSS file `<library>.ctss.bed`.
Merge the CTSS files from the CAGE libraries:
```
cat ??_hr_?.ctss.bed | bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes | bedtools merge -s -d -1 -c 4,5,6 -o distinct,sum,distinct -i stdin > CAGE.ctss.bed
```
Create a table with the CAGE tag count for each promoter, and the position with the highest CAGE tag count for each promoter:
```
python make_promoter_info_table.py 
```
creating the file `promoters.FANTOM_CAT.THP-1.bed`.

Create a table with the dominant promoter for each gene, and the distance of each gene to other genes:
```
python make_gene_info_table.py
```
generating the GFF file `genes.FANTOM_CAT.THP-1.gff`

Create a promoter expression table:
```
python make_promoter_expression_table.py 
```
generating the file `promoters.FANTOM_CAT.THP-1.counts.txt`.

Create a gene expression table:
```
python make_gene_expression_table.py
```
generating the file `genes.FANTOM_CAT.THP-1.counts.txt`.

Compress and store the CTSS files:
```
gzip ??_hr_?.ctss.bed
mkdir /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS
mv ??_hr_?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS
```
Remove the intermediate files:
```
rm CAGE.ctss.bed
rm script_CAGE_*.sh
rm script_CAGE_*.stdout
rm script_CAGE_*.stderr
```

## 8.4 Create CTSS files for the StartSeq data

```
python make_create_ctss_file_scripts.py StartSeq
bash -x script.sh
```
This will run
```
python create_ctss_file.py StartSeq <library>
```
on each library, generating the CTSS file `<library>.ctss.bed`.
Compress and store the CTSS files:
```
gzip SRR707145?.ctss.bed
mkdir /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/CTSS
mv SRR707145?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/CTSS
```
Remove the intermediate files:
```
rm script_StartSeq_*.sh
rm script_StartSeq_*.stdout
rm script_StartSeq_*.stderr
```

# 9. Analyze sequences associated with genes

## 9.1 Analyze the MiSeq sequencing data

For each mapping target, create a scatter plot with each dot representing one paired-end read, with on the horizontal axis the length of the target against which the transcript was mapped and on the vertical axis the length of the transcript inferred from the paired-end sequencing data. Dots corresponding to full-length transcripts will lie on the identity line (shown as a straight red line). The size selection limits are indicated as dashed red lines. Note that the size selection is not perfect; transcripts beyond the size selected for may still be present in the library if their expression level is sufficiently high.

Using the time course data only:
```
python -i make_target_transcript_size_scatterplot_timecourse.py mRNA
```
This reports that there are 7123 mRNAs with truncated transcripts, and generates the figure `figure_scatter_sizes_mRNA_timecourse.png`.
```
python -i make_target_transcript_size_scatterplot_timecourse.py lncRNA
```
This reports that there are 1012 lncRNAs with truncated transcripts, and generates the figure `figure_scatter_sizes_lncRNA_timecourse.png`.

This suggests that there is a family of transcripts that overlap known mRNA or lncRNA transcript isoforms, but have a shorter length, i.e. their 5' end is downstream of the promoter of the known mRNA or lncRNA, or their 3' end is shorter, i.e. the transcript is truncated compared to the full-length mRNA or lncRNA transcript.

Draw the distribution of the position of the 3' end with respect to mRNA and lncRNA exon-exon boundaries:
```
python -i make_figure_spliceboundary_timecourse.py
```
generating the figure `figure_spliceboundary_timecourse.png`.

This script also reports that 7170 coding genes and 768 non-coding genes have truncated transcripts, that 2930 (41.5%) out of 7068 coding genes with spliced mRNAs and 141 out of 668 (21.1%) non-coding genes with spliced lncRNAs have significantly enriched termination at splice sites, that overall termination at splice sites is highly significant (Fisher combined p-value < 1.e-100), that the median size of RNAs terminating at splice sites was 216 nt, and that the average percentage of short capped RNAs terminating at splice sites of coding and non-coding genes, if significant, was 55.6% and 73.6%, respectively.

Note that specific genes can show termination at a different site, as shown in figure `figure_termination_site_mRNA_DBI_timecourse.png` for DBI:
```
python -i make_figure_termination_site_timecourse.py mRNA NM_001079862.4
```

## 9.2 Analyze distribution around the transcription start site

Draw the profile with respect to the CAGE-annotated transcription start site of each gene:
```
python make_figure_promoter_distribution_timecourse.py
```
generating the figure `figure_promoter_distribution_timecourse.png`.

# 10. Analyze concordance between MiSeq and HiSeq

Combine the MiSeq and HiSeq CTSSs into a single set of CTSSs, excluding time point 1 hr replicate 3, as it was replaced by a sample negative control for HiSeq:

```
zcat /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t00_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t01_r1.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t01_r2.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t04_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t12_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t24_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t96_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r?.ctss.bed.gz | bedtools sort | bedops --ec -u - | bedtools merge -s -d -1 -c 4,5,6 -o distinct,sum,distinct > merged.MiSeq_HiSeq.ctss.bed
```

generating the file `merged.MiSeq_HiSeq.ctss.bed`. Use paraclu to define peaks:
```
cat merged.MiSeq_HiSeq.ctss.bed | sort -k1,1 -k6,6 -k2,2n | cut -f 1,2,5,6 | while read col1 col2 col3 col4; do printf "$col1\t$col4\t$col2\t$col3\n"; done | paraclu 10 - | paraclu-cut.sh | while read col1 col2 col3 col4 col5 col6 col7 col8; do printf "$col1\t$col3\t$col4\t${col1}_${col3}-${col4}_${col2}\t$col6\t$col2\n"; done | bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes > peaks.MiSeq_HiSeq.bed
```
generating a set of 151555 peaks.

Create a BED file for each library with the total expression of each promoter:
```
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t00_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t00_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t00_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t00_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t00_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t00_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t01_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t01_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t01_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t01_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t04_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t04_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t04_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t04_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t04_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t04_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t12_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t12_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t12_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t12_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t12_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t12_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t24_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t24_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t24_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t24_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t24_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t24_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t96_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t96_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t96_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t96_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/CTSS/t96_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > MiSeq.t96_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r1.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r2.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r3.ctss.bed.gz | bedtools intersect -a peaks.MiSeq_HiSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r3.expression.bed
```
Merge these BED files into an expression table:
```
python merge_expression.py MiSeq HiSeq
```
generating the file `peaks.MiSeq_HiSeq.expression.txt`. Remove the intermediate BED files:
```
rm merged.MiSeq_HiSeq.ctss.bed
rm MiSeq.*.expression.bed
rm HiSeq.*.expression.bed
```
Create scatter plots comparing the count of each peak between corresponding MiSeq and HiSeq libraries:
```
python -i make_figure_miseq_hiseq_concordance.py
```
generating figure `figure_miseq_hiseq_concordance.png`.

The blue dot indicates DBI in each panel.
This script also calculates the global dispersion for each sample, which is shown above each panel, and as a bar graph in `figure_miseq_hiseq_dispersion.png`.
The average value of the global dispersion was 0.8240.

The mean Pearson correlation was 0.33. Note that this correlation value is low due to the limited sequencing depth of the MiSeq libraries. To get a corresponding correlation value for the sequencing depth of HiSeq libraries, which are used in the CAGE to HiSeq differential expression analysis, use
```
python -i simulate_correlation_miseq_hiseq.py 
```
This generates random data following the negative binomial distribution for the same TPM values as shown in `figure_miseq_hiseq_concordance.png`, and with the dispersion chosen such that the Pearson correlation at the MiSeq sequencing deptch is 0.33. We find that the MiSeq libraries on average have 5348 counts in total, the HiSeq libraries on average have 9683471 counts in total.
The correlation value, after log-transformation, of the generated data at the MiSeq read depth is 0.33; without log-transformation the correlation value is 0.81.
The correlation value, after log-transformation, of the generated data at the HiSeq read depth is 0.78; without log-transformation the correlation value is 0.88.

Find the overlap between the MiSeq and HiSeq data and the DBI promoter:
```
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t00_r1.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t00_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t00_r2.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t00_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t00_r3.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t00_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t01_r1.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t01_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t01_r2.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t01_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t04_r1.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t04_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t04_r2.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t04_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t04_r3.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t04_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t12_r1.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t12_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t12_r2.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t12_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t12_r3.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t12_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t24_r1.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t24_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t24_r2.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t24_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t24_r3.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t24_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t96_r1.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t96_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t96_r2.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t96_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/MiSeq/Mapping/t96_r3.bam -b dbi.bed | samtools view -hu -f 64 | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > MiSeq.t96_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t00_r1.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t00_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t00_r2.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t00_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t00_r3.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t00_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t01_r1.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t01_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t01_r2.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t01_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t04_r1.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t04_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t04_r2.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t04_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t04_r3.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t04_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t12_r1.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t12_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t12_r2.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t12_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t12_r3.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t12_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t24_r1.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t24_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t24_r2.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t24_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t24_r3.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t24_r3.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t96_r1.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t96_r1.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t96_r2.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t96_r2.sam
bedtools intersect -u -s -a /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/Mapping/t96_r3.bam -b dbi.bed | samtools calmd - /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.fa > HiSeq.t96_r3.sam
```
Draw the library fraction mapping to the DBI promoter in the MiSeq and HiSeq libraries, and the distribution of tags over the DBI promoter region in each of these libraries:

```
python -i make_figure_dbi_overrepresentation.py 
````
generating the figures `figure_dbi_overrepresentation.png` and `figure_dbi_overrepresentation_promoter.png`.

Remove the SAM files:
```
rm MiSeq.t??_r?.sam
rm HiSeq.t??_r?.sam
```

# 11. Analysis of gene-associated peaks

## 11.1 Analysis of CAGE+HiSeq peaks

Combine the HiSeq and CAGE CTSSs into a single set of CTSSs:
```
zcat /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/*.ctss.bed.gz | bedtools sort | bedops --ec -u - | bedtools merge -s -d -1 -c 4,5,6 -o distinct,sum,distinct > merged.ctss.bed
```
generating the file `merged.ctss.bed`. Use paraclu to define peaks:
```
cat merged.ctss.bed | sort -k1,1 -k6,6 -k2,2n | cut -f 1,2,5,6 | while read col1 col2 col3 col4; do printf "$col1\t$col4\t$col2\t$col3\n"; done | paraclu 10 - | paraclu-cut.sh | while read col1 col2 col3 col4 col5 col6 col7 col8; do printf "$col1\t$col3\t$col4\t${col1}_${col3}-${col4}_${col2}\t$col6\t$col2\n"; done | bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes > peaks.bed
```
generating a set of 286343 peaks.

Create a BED file for each library with the total expression of each promoter:
```
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_A.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_C.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_G.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_G.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_H.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_H.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/01_hr_A.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.01_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/01_hr_C.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.01_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/01_hr_G.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.01_hr_G.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/04_hr_C.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.04_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/04_hr_E.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.04_hr_E.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/12_hr_A.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.12_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/12_hr_C.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.12_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/24_hr_C.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.24_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/24_hr_E.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.24_hr_E.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/96_hr_A.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.96_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/96_hr_C.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.96_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/96_hr_E.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.96_hr_E.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r1.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r2.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r3.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r1.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r2.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r3.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r1.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r2.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r3.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r1.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r2.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r3.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r1.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r2.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r3.ctss.bed.gz | bedtools intersect -a peaks.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r3.expression.bed
```

Merge these BED files into an expression table:
```
python merge_expression.py
```
creating the file `peaks.expression.txt`. Remove the intermediate BED files:
```
rm CAGE.*.expression.bed
rm HiSeq.*.expression.bed
rm merged.ctss.bed
```
Perform differential expression analysis:
```
python perform_deseq.py
```
generating a file `peaks.deseq.txt` with the log-ratios and adjusted p-values at each time point, and across all time points. We use the glmGamPoi package to estimate the dispersion, as the tag counts can be low; reference is

Constantin Ahlmann-Eltze, Wolfgang Huber: "glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data." Bioinformatics 36(24):5701-5702 (2021). PMID 33295604.

Annotate each peak based on the annotations of each sequenced read:
```
python make_peak_annotation.py
```
This writes a new file `peaks.gff` with the annotation written in the feature column on each line. The -log10(adjusted pvalue) for differential expression between CAGE and HiSeq is stored in the attributes on each line, and the source column contains "HiSeq" if expression was higher in the HiSeq data compared to the CAGE data, and "CAGE" otherwise. The names of genes associated with each peak, if any, is also stored as an attribute on each line.

Make a Venn diagram summarizing the differential expression results:
```
python -i make_figure_peak_venn_diagram.py
```
generating `figure_peak_venn_diagram.png`.

For each peak, find the transcription start site with the highest expression level:
```
python annotate_peaks_tss.py
```
which will write a new file `peaks.gff` with the TSS according to the CAGE and HiSeq data as attributes `CAGE_tss` and `HiSeq_tss`, respectively.

For each gene, find the associated peak that has the highest CAGE tpm expression averaged across the time course:
```
python annotate_main_peak.py
```
This writes a new file `peaks.gff` with a new attribute `dominant` that lists the names of the genes, if any, for which the peak is the dominant peak.

Evaluate if capped long RNAs (as measured by CAGE) and capped short RNAs (as measured by HiSeq) originating from the same promoter are co-regulated during the time course:

```
python -i make_figure_timecourse.py
```
This generates two figures. The first figure, `figure_sense_timecourse_scatter.png`, shows a scatter plot of, for each peak, the average CAGE expression across the time course vs the average HiSeq expression across the time course. The Pearson correlation across genes calculated from this scatter plot was 0.68 (p = 0).

The second figure, `figure_sense_timecourse_logratios.png`, shows a hexbin plot of the CAGE expression log-ratio and the HiSeq expression log-ratio during the time course (Spearman correlation = 0.29, Pearson correlation = 0.31; p = 0):

## 11.2 Analysis of StartSeq+HiSeq peaks

Combine the HiSeq and StartSeq CTSSs into a single set of CTSSs:
```
zcat /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r?.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/CTSS/SRR7071452.ctss.bed.gz /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/CTSS/SRR7071453.ctss.bed.gz | bedtools sort | bedops --ec -u - | bedtools merge -s -d -1 -c 4,5,6 -o distinct,sum,distinct > merged.ctss.bed
```
generating the file `merged.ctss.bed`. Use paraclu to define peaks:
```
cat merged.ctss.bed | sort -k1,1 -k6,6 -k2,2n | cut -f 1,2,5,6 | while read col1 col2 col3 col4; do printf "$col1\t$col4\t$col2\t$col3\n"; done | paraclu 10 - | paraclu-cut.sh | while read col1 col2 col3 col4 col5 col6 col7 col8; do printf "$col1\t$col3\t$col4\t${col1}_${col3}-${col4}_${col2}\t$col6\t$col2\n"; done | bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes > peaks.HiSeq_StartSeq.bed
```
generating a set of 163282 peaks.

Create a BED file for each library with the total expression of each promoter:
```
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/CTSS/SRR7071452.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > StartSeq.SRR7071452.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/StartSeq/CTSS/SRR7071453.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > StartSeq.SRR7071453.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r1.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r2.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r3.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r1.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r2.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r3.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r1.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r2.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r3.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r1.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r2.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r3.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r1.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r2.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r3.ctss.bed.gz | bedtools intersect -a peaks.HiSeq_StartSeq.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r3.expression.bed
```
Merge these BED files into an expression table:
```
python merge_expression.py HiSeq StartSeq
```
creating the file `peaks.HiSeq_StartSeq.expression.txt`. Remove the intermediate BED files:
```
rm HiSeq.*.expression.bed
rm StartSeq.*.expression.bed
rm merged.ctss.bed
```
Perform differential expression analysis:
```
python perform_deseq_startseq.py
```
generating a file `peaks.HiSeq_StartSeq.deseq.txt` with the log-ratio and adjusted p-value for each peak. We use the glmGamPoi package to estimate the dispersion, as the tag counts can be low; reference is

Constantin Ahlmann-Eltze, Wolfgang Huber: "glmGamPoi: fitting Gamma-Poisson generalized linear models on single cell count data." Bioinformatics 36(24):5701-5702 (2021). PMID 33295604.

Annotate each peak based on the annotations of each sequenced read:
```
python make_peak_annotation.py HiSeq StartSeq
```
This writes a new file `peaks.HiSeq_StartSeq.gff` with the annotation written in the feature column on each line. The -log10(pvalue) for differential expression between StartSeq and HiSeq is stored in the attributes on each line, and the source column contains "HiSeq" if expression was higher in the HiSeq data compared to the StartSeq data, and "StartSeq" otherwise. The names of genes associated with each peak, if any, is also stored as an attribute on each line.

For each gene, find the associated peak that has the highest HiSeq tpm expression averaged across the time course:
```
python annotate_main_peak_startseq.py
```
This writes a new file `peaks.HiSeq_StartSeq.gff` with a new attribute `dominant` that lists the names of genes, if any, for which the peak is the dominant peak.

Make a Venn diagram summarizing the differential expression results:
```
python -i make_figure_peak_startseq_venn_diagram.py
```
generating `figure_peak_startseq_venn_diagram.png`, and reporting that 72712 peaks are expressed in the HiSeq data only, and 1519 in the StartSeq data only.

This script also draws a scatter plot (`figure_sense_startseq_scatter.png`) of the StartSeq tag count vs the HiSeq tag count for all dominant peaks, and reports that the Pearson correlation across genes calculated from this scatter plot was 0.44 (p = 0).


# 12. Analyze enhancer expression

Count the number of HiSeq enhancers that do not overlap CAGE enhancers:
```
intersectBed -v -a enhancer_predictions/enhancers.HiSeq.bed -b enhancer_predictions/enhancers.CAGE.bed | wc
```
which reports that 7779 HiSeq enhancers do not overlap CAGE enhancers.

Count the number of CAGE enhancers that do not overlap HiSeq enhancers:
```
intersectBed -v -a enhancer_predictions/enhancers.CAGE.bed -b enhancer_predictions/enhancers.HiSeq.bed | wc
```
which reports that 14663 HiSeq enhancers do not overlap CAGE enhancers.

Create a sorted BED file with all enhancers. The script `make_enhancer_joint_list.py` creates a BED file with 63285 FANTOM5 enhancers, 5688 enhancers discovered in HiSeq only, 12515 enhancers discovered in CAGE only, and 1785 enhancers discovered both in HiSeq and CAGE (83273 enhancers in total):
```
python make_enhancer_joint_list.py
```
generating the file `enhancers.bed`. To find the expression level of each enhancer, create a BED file with for each enhancer a window (center - 200, center) on the negative strand and (center+1, center+201) on the positive strand:
```
python make_enhancer_windows.py
```
creating the file `enhancers.windowed.bed`.
Sort this list:
```
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i enhancers.windowed.bed > enhancers.sorted.bed
rm enhancers.windowed.bed
rm enhancers.bed
```
Create a BED file for each library with the total expression of each enhancer:
```
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_A.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_C.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_G.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_G.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/00_hr_H.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.00_hr_H.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/01_hr_A.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.01_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/01_hr_C.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.01_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/01_hr_G.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.01_hr_G.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/04_hr_C.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.04_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/04_hr_E.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.04_hr_E.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/12_hr_A.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.12_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/12_hr_C.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.12_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/24_hr_C.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.24_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/24_hr_E.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.24_hr_E.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/96_hr_A.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.96_hr_A.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/96_hr_C.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.96_hr_C.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/CAGE/CTSS/96_hr_E.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > CAGE.96_hr_E.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r1.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r2.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t00_r3.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t00_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r1.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r2.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t01_r3.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t01_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r1.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r2.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t04_r3.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t04_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r1.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r2.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t12_r3.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t12_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r1.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r2.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t24_r3.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t24_r3.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r1.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r1.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r2.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r2.expression.bed
bedtools sort -g /osc-fs_home/scratch/mdehoon/Data/Genomes/hg38/hg38.chrom.sizes -i /osc-fs_home/mdehoon/Data/CASPARs/HiSeq/CTSS/t96_r3.ctss.bed.gz | bedtools intersect -a enhancers.sorted.bed -b stdin -wa -wb -s | cut -f 1-4,11,12 | bedtools merge -s -d -1 -c 4,5,6 -o first,sum,distinct > HiSeq.t96_r3.expression.bed
```

Generate an expression table for enhancers:
```
python merge_expression_enhancers.py
```
generating the file `enhancers.expression.txt` with 45827 expressed enhancers.

Perform differential expression analysis on this enhancer expression table:
```
python perform_deseq_enhancers.py
```
generating a file `enhancers.deseq.txt` with the log-ratios and adjusted p-values at each time point, and across all time points. Again we use the glmGamPoi package to estimate the dispersion, as the tag counts can be low.

Make a Venn diagram summarizing the differential expression results, and draw a scatter plot of the enhancer expression levels measured by CAGE and HiSeq:
```
python -i make_figure_enhancer_venn_diagram.py 
```
generating a Venn diagram in `figure_enhancer_venn_diagram.png` and a scatter plot in `figure_enhancer_scatterplot.png`.

The Pearson correlation was 0.44, with a calculated p-value equal to 0.

# 13. GWAS analysis of enhancers

Download all SNPs from dbSNP:
```
wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.38.gz
wget https://ftp.ncbi.nih.gov/snp/latest_release/VCF/GCF_000001405.38.gz.tbi
md5sum GCF_000001405.38.gz
```
reports MD5 sum `639c0414d291e65fb1f16f7cc29c7166`. Time stamp at NCBI is `2020-05-14 14:20`.
Convert this file to BCF format:
```
bcftools view GCF_000001405.38.gz -O b -o GCF_000001405.38.bcf
```
Save the output file:
```
mv GCF_000001405.38.bcf /osc-fs_home/scratch/mdehoon/Data/NCBI/dbSNP/
```
Now we can remove the VCF file and its index:
```
rm GCF_000001405.38.gz
rm GCF_000001405.38.gz.tbi
```
Index the `bcf` file:
```
bcftools index /osc-fs_home/scratch/mdehoon/Data/NCBI/dbSNP/GCF_000001405.38.bcf
```
Download the GWAS Catalog from https://www.ebi.ac.uk/gwas/api/search/downloads/full, and save it as
`gwas_catalog_v1.0-associations_e105_r2022-04-07.tsv`. Compress this file and save it:
```
bzip2 gwas_catalog_v1.0-associations_e105_r2022-04-07.tsv
mv gwas_catalog_v1.0-associations_e105_r2022-04-07.tsv.bz2 /osc-fs_home/scratch/mdehoon/Data/EBI/
```
Create a VCF file with these GWAS hits:
```
python make_gwas_vcffile.py 
```
generating the file `gwas.vcf`. Convert the VCF file to a BCF file:
```
bcftools view -O b -o gwas.bcf gwas.vcf
```
Sort the BCF file:
```
bcftools sort -O b -o gwas.sorted.bcf gwas.bcf 
mv gwas.sorted.bcf gwas.bcf 
```
Index the BCF file:
```
bcftools index gwas.bcf 
```
Analyze trait-associated GWAS enrichment:
```
python -i analyze_gwas.py
```
generating the figure `figure_enhancer_gwas.png`.

# 14. Find the G-enrichment at the 5' end of tags

Count the nucleotide frequency at the 5' end of sequenced reads belonging to each functional category::
```
python make_find_g_enrichment_scripts.py MiSeq
bash -x script_MiSeq.sh
python make_find_g_enrichment_scripts.py HiSeq
bash -x script_HiSeq.sh
python make_find_g_enrichment_scripts.py CAGE
bash -x script_CAGE.sh
```
This runs
```
python -i find_g_enrichment.py <dataset> <library>
```
on each CAGE library, generating a file `firstnucleotide.<dataset>.<library>.txt` with the nucleotide count (A, C, G, T, N) at the 5' end of sequenced reads, separated by functional category. Merge and store the results:
```
python merge_g_enrichment_results.py
```
generating a new file `firstnucleotide.txt`. Remove the intermediate files:
```
rm script_MiSeq.sh
rm script_HiSeq.sh
rm script_CAGE.sh
rm script_MiSeq_???_r?.sh
rm script_HiSeq_t??_r?.sh
rm script_CAGE_??_hr_?.sh
rm script_MiSeq_???_r?.stdout
rm script_HiSeq_t??_r?.stdout
rm script_CAGE_??_hr_?.stdout
rm script_MiSeq_???_r?.stderr
rm script_HiSeq_t??_r?.stderr
rm script_CAGE_??_hr_?.stderr
rm firstnucleotide.MiSeq.???_r?.txt
rm firstnucleotide.HiSeq.t??_r?.txt
rm firstnucleotide.CAGE.??_hr_?.txt
```

Run
```
python -i make_figure_g_enrichment_timecourse.py
```
to draw a stacked bar graph showing the distribution of the first nucleotide of sequences of the different categories, separately for each data set and library. This will generate the figure `figure_g_enrichment_timecourse.png`.

# 15. Analyze ChIP-Seq data

Analyze ChIP-Seq data to find the H3K4me1, H3K4me2, and H3K4me3 enrichment around predicted enhancers:
```
python -i make_figure_enhancer_chipseq.py
```
generating the figure `figure_enhancer_chipseq.png`.

This script also reports the Mann-Whitney enrichment p-values:
- H3K4me1, short capped RNAs: Mann-Whitney p = 5.73159e-120
- H3K4me1, long capped RNAs: Mann-Whitney p = 4.83358e-169
- H3K4me3, short capped RNAs: Mann-Whitney p = 3.13064e-187
- H3K4me3, long capped RNAs: Mann-Whitney p = 0

The H3K4me1/H3K4me3 enrichment was significantly higher for short capped RNAs compared to long capped RNAs by a factor of 1.337754 (p-value = 0.0011659 based on a Z-test).

# 16. Analyze enhancer activity from reporter assay data

Download Supplementary Table S3 from Robin Andersson's paper, and store it in
`/osc-fs_home/mdehoon/Data/Fantom5/Robin/TableS3.xlsx`.

Create the BED file `reporters.bed` with the genome coordinates for human
genome assembly hg19:
```
python create_reporter_assay_bedfile.py 
```
Convert this BED file to hg38 coordinates:
```
liftOver reporters.bed /osc-fs_home/mdehoon/Data/UCSC/hg19ToHg38.over.chain.gz reporters.hg38.bed unmapped
```
The `unmapped` file is empty.

Regenerate the set of 83321 HiSeq, CAGE, and FANTOM5 enhancers:
```
python make_enhancer_joint_list.py
```
creating the file `enhancers.bed`.

Find the overlap between the enhancer set and the reporter locations:
```
intersectBed -a enhancers.bed -b reporters.hg38.bed -wa -wb > overlap.txt
```
This finds 117 overlaps. Note that some of these were defined in FANTOM5, and are not expressed in our CAGE or HiSeq data sets.

Make a figure with the percentage of enhancers that have significant reporter activity:
```
python -i make_figure_enhancer_reporters.py 
```
generating the figure `figure_enhancer_reporter.png`.

Percentages are 78.57% and 63.16%;
Fisher-exact odds ratio is 2.14, with p = 0.27839.
